This notebook relates to the TensorFlow Speech Commands Dataset. TensorFlow Speech Command dataset is a set of one-second .wav
audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. It was designed for limited vocabulary speech recognition tasks. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook, we will visualize, edit and compare sample audio files which saved by the previous notebook.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
# Import required libraries
import pandas as pd
import io
# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
from sklearn.decomposition import PCA
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Check data assets in this project.
# Extract a sorted list of all assets associated with this project
file_names = sorted([d['name'] for d in project.get_files()])
file_names
len([d['name'] for d in project.get_files()])
Sample rate is how frequently samples are taken. It’s measured in “samples per second” and is usually expressed in kiloHertz (kHz), a unit meaning 1,000 times per second. Audio CDs, for example, have a sample rate of 44.1kHz, which means that the analog signal is sampled 44,100 times per second. If the audio sample rate is 16kHz, then the analog signal is sampled 16,000 times per second.
sample_rate, samples = wavfile.read(project.get_file('bird_0a7c2a8d_nohash_0.wav'))
print('Audio Sample Rate', sample_rate)
Let's create a function that calculates the spectrogram of the raw audio files. A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. We will also be using a log scale for these spectrogram values. The weighted value is much more easier to plot and compare. Additionally, we remove any null/zero values as we are using a log scale.
The inputs of this function are samples extracted from the wav file, the sample rate, the size of the frame in milliseconds, the step (stride or skip) size in milliseconds and a small offset. The outputs are defined similar to the SciPy manual. The log_spectrum
function returns three values, including an array of sample frequencies, an array of segment times and an adjusted log value of spectrogram of x.
We rescale the spectrogram with log function for the sake of calculation and visualization. Since there are much more large values than small values, we don't want the large ones dominate the computation. Taking log value, it compresses the differences between large and small values while still keeping the order.
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
freqs, times, spectrogram = log_specgram(samples, sample_rate)
data = [go.Surface(z=spectrogram.T)]
layout = go.Layout(
title='Specgtrogram of "bird" in 3d',
scene = dict(
yaxis = dict(title='Frequency'),
xaxis = dict(title='Time'),
zaxis = dict(title='Log amplitude'),
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
There are two main properties of a regular vibration - the amplitude and the frequency - which affect the way it sounds.
Amplitude is the size of the vibration, which determines how loud the sound is. The larger the size of vibrations, the louder the sound. Amplitude is important when balancing and controlling the loudness of sounds, such as with the volume control on your computer.
Frequency is the speed of the vibration, which determines the pitch of the sound. The faster the speed of the vibrations, the higher the tone.
# Visualize a audio clip
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(111)
ax1.set_title('Raw wave of bird')
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)
# Create a spectrogram of audio clip
fig = plt.figure(figsize=(14, 8))
ax2 = fig.add_subplot(111)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of bird')
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')
In the 3-D Figure above, the amplitude has significant increase after time reaches 0.5 second. It means the main audio sound happens from time: 0.5 sec. Hear the audio sample, you can notice a silence gap at the beginning of the audio sample. Let's listen to the audio in next section.
In this section, we will edit and modify the audio sample bird.wav
by silence removal and audio resampling.
In previous visualization, we realize there is a silence gap at the beginning of the audio sample. We want to shorten the sound file and cut the silence part. Let's listen to the original "bird" sound file first.
ipd.Audio(samples, rate=sample_rate)
Let's cut a bit of the file at the beginning and at the end. And listen to it again. Based on the amplitude plot above, the sound is from 0.4 second to 0.9 second. 0.43*16000 = 6880
and 0.9*16000 = 14000
. Thus, we cut the audio sample into 7000 to 14000.
samples_cut = samples[7000:14000]
ipd.Audio(samples_cut, rate=sample_rate)
Checking on the trimmed audio, we can agree that the entire word can be heard.
Next, we want to visualize the trimmed audio. VAD (Voice Activity Detection) will be a useful technique in here. Voice activity detection (VAD) is a technique in which the presence or absence of human speech is detected. Even though the words are short, there is still a lot of silence in them. The detection can be used to trigger a process.
A good VAD can reduce training size a lot, accelerating training speed significantly. Feel free to explore more. Reference: Voice Activity Detection
It is impossible to cut all the files manually and do this basing on the simple plot. But we can use webrtcvad package to have a good VAD.
Let's plot the audio sample, together with guessed alignment of 'b' 'ir' 'd' graphems.
freqs, times, spectrogram_cut = log_specgram(samples_cut, sample_rate)
fig = plt.figure(figsize=(14, 4))
ax1 = fig.add_subplot(111)
ax1.set_title('Raw Wave of bird sample')
ax1.set_ylabel('Amplitude')
ax1.plot(samples_cut)
fig = plt.figure(figsize=(14, 4))
ax2 = fig.add_subplot(111)
ax2.set_title('Spectrogram of bird sample')
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.075, 1000, 'B', fontsize=18)
ax2.text(0.16, 1000, 'IR', fontsize=18)
ax2.text(0.27, 1000, 'D', fontsize=18)
xcoords = [0.05, 0.1, 0.23, 0.312]
for xc in xcoords:
ax1.axvline(x=xc*16000, c='r')
ax2.axvline(x=xc, c='r')
Resampling recordings is another way to reduce the dimensionality of data.
Most of speech related frequencies are present in a small band. The GSM (2G wireless communication) signal is sampled to 8,000 Hz, and people can still understand one another when talking on the telephone.
Resampling the dataset from 16k to 4k will reduce the size of data. We will perform resampling in this section.
In order to resample, we'll need to first calculate the FFT (Fast Fourier Transform).
A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Reference: FFT Wiki
Human ear process audio sample similar to fast fourier transform mechanics. Our ears formulates a transform by converting sound—the waves of pressure traveling over time and through the atmosphere—into a spectrum, a description of the sound as a series of volumes at distinct pitches. The brain then turns this information into perceived sound.
The Fast Fourier Transform (FFT) is calculated below:
def custom_fft(y, fs):
T = 1.0 / fs
N = y.shape[0]
yf = fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
vals = 2.0/N * np.abs(yf[0:N//2]) # FFT is simmetrical, so we take just the first half
return xf, vals
Let's read one audio sample, resample it, and listen. We can also compare FFT, Notice, that there is almost no information above 4000 Hz in original signal.
# Set new sample rate
new_sample_rate = 4000
# Read in the bird audio sample with new sample rate
sample_rate, samples = wavfile.read(project.get_file('bird_0a7c2a8d_nohash_0.wav'))
resampled = signal.resample(samples, int(new_sample_rate/sample_rate * samples.shape[0]))
# Play resampled audio
ipd.Audio(resampled, rate=new_sample_rate)
Now, we want to visualize and compare the FFT graph of both original and resampled audio file.
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(sample_rate) + ' Hz')
plt.xlim(left=0, right=8000)
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.xlim(left=0, right=8000)
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
From the FFT graph, the resampled 4000 Hz FFT graph is truncated at 2000 frequency. It means that any sound above 2000 frequency will not be included in resampled audio. This explains why the resampled audio file is vague.
In this section, we want to compare the differences between audio files.
First, let's visualize all audio files which have distinct labels.
file_name = ['bird_0a7c2a8d_nohash_0.wav', 'cat_0ab3b47d_nohash_0.wav', 'dog_0b09edd3_nohash_1.wav',
'off_0ab3b47d_nohash_0.wav', 'on_0a7c2a8d_nohash_0.wav', 'right_0a7c2a8d_nohash_0.wav',
'sheila_00f0204f_nohash_1.wav', 'up_0a7c2a8d_nohash_0.wav', 'zero_0c40e715_nohash_0.wav']
fig = plt.figure(figsize=(8,8))
fig.suptitle('Spectrogram', fontsize=16)
# for each of the samples
for i, filepath in enumerate(file_name):
# Make subplots
plt.subplot(3,3,i+1)
# pull the labels
label = filepath.split('_')[0]
plt.title(label)
# create spectogram
sample_rate, samples = wavfile.read(project.get_file(filepath))
_, _, spectrogram = log_specgram(samples, sample_rate)
plt.imshow(spectrogram.T, aspect='auto', origin='lower')
# set no axis label
plt.axis('off')
# Create another spectrogram
fig = plt.figure(figsize=(8,13))
fig.suptitle('Raw Audio', fontsize=16)
for i, filepath in enumerate(file_name):
plt.subplot(10,1,i+1)
sample_rate, samples = wavfile.read(project.get_file(filepath))
plt.title(filepath.split('_')[0])
plt.axis('off')
plt.plot(samples)
Next, let's visualize audio files that have same label.
# Define bird files
file_name = [f for f in file_names if 'bird' in f]
fig = plt.figure(figsize=(8,8))
fig.suptitle('Spectrogram', fontsize=16)
for i, filepath in enumerate(file_name):
# Make subplots
plt.subplot(3,3,i+1)
# pull the labels
label = filepath.split('_')[0]
plt.title(label)
# create spectogram
sample_rate, samples = wavfile.read(project.get_file(filepath))
_, _, spectrogram = log_specgram(samples, sample_rate)
plt.imshow(spectrogram.T, aspect='auto', origin='lower')
plt.axis('off')
fig = plt.figure(figsize=(8,13))
fig.suptitle('Raw Audio', fontsize=16)
for i, filepath in enumerate(file_name):
plt.subplot(10,1,i+1)
sample_rate, samples = wavfile.read(project.get_file(filepath))
plt.title(filepath.split('_')[0])
plt.axis('off')
plt.plot(samples)
Now, we want to check if any recordings are "outliers" which are different from all others. We can lower the dimensionality of the dataset and interactively check for any anomaly. Let's use Principal Component Analysis (PCA) for dimensionality reduction.
So, what is PCA? PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. To explain in simple words, given a dataset with a number of features, PCA finds a way to approximate those original features using less, more effective and explainable features which are statistically similar representations to the original ones.
Reference: Principle Component Analysis (PCA)
To use PCA in this scenario, we want to reshape an audio sample into 1024 columns by however many rows are needed and fill in zero values if necessary. Run the PCA algorithm, and do dimensionality reduction to three in our case. While losing some of the data that was in the original file, we compress each data files into three dimensional data points which we can use to compare.
ffts, audio_names = [], []
for filepath in file_names:
sample_rate, samples = wavfile.read(project.get_file(filepath))
if samples.shape[0] != sample_rate:
samples = np.append(samples, np.zeros((sample_rate - samples.shape[0], )))
x, values = custom_fft(samples, sample_rate)
ffts.append(values)
audio_names.append(filepath)
# Set ffts from list type to array
ffts = np.array(ffts)
# Normalization: (Datapoint - mean)/standard deviation
ffts = (ffts - np.mean(ffts)) / np.std(ffts)
# Reduce the dimension to 3D
pca = PCA(n_components=3)
ffts = pca.fit_transform(ffts)
def interactive_3d_plot(data, names):
scatt = go.Scatter3d(x=data[:, 0], y=data[:, 1], z=data[:, 2], mode='markers', text=names)
data = go.Data([scatt])
layout = go.Layout(title="Anomaly detection of Audio Samples")
figure = go.Figure(data=data, layout=layout)
py.iplot(figure)
interactive_3d_plot(ffts, audio_names)
From the 3-D graph, we can see two audio files are a little bit different from all others. This graph is to illustrate the concept of the model using only 14 data files. The result here might not be very informative, so feel free to run this model with 100, 1000, or even more data files.
This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.